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ABSTRACT 

H-lnvDB (http://www.h-invitational.jp/) is a compre- 
hensive human gene database started in 2004. In the 
latest version, H-lnvDB 8.0, a total of 244709 human 
complementary DNA was mapped onto the hg19 ref- 
erence genome and 43829 gene loci, including 
nonprotein-coding ones, were identified. Of these 
loci, 35631 were identified as potential protein- 
coding genes, and 22898 of these were identical to 
known genes. In our analysis, 19309 annotated 
genes were specific to H-lnvDB and not found in 
RefSeq and Ensembl. In fact, 233 genes of the 
19309 turned out to have protein functions in this 
version of H-lnvDB; they were annotated as 
unknown protein functions in the previous version. 
Furthermore, 11 genes were identified as known 
Mendelian disorder genes. It is advantageous that 
many biologically functional genes are hidden in 
the H-lnvDB unique genes. As large-scale proteomic 
projects have been conducted to elucidate the func- 
tions of all human proteins, we have enhanced the 
proteomic information with an advanced protein 
view and new subdatabase of protein complexes 
(Protein Complex Database with quality index). We 
propose that H-lnvDB is an important resource for 
finding novel candidate targets for medical care and 
drug development. 

INTRODUCTION 

Along with the sequencing of the first human reference 
genome (1), several lines of human transcriptome study 
using a large number of validated human transcripts 
were carried out. As full-length complementary DNA 
(cDNA) is the ideal resource for the study, our consortium 



aimed to collect human full-length cDNA sequenced by 
four projects: Full-Length cDNA Japan (FLJ) (2), Human 
Unidentified Gene-Encoded Large Proteins (HUGE) (3), 
Mammalian Gene Collection (MGC) (4) and Munich 
Information Centre for Protein Sequences (MIPS) (5). 
These projects were conducted at five institutions: New 
Energy and Industrial Technology Development 
Organization (NEDO), Kazusa DNA Research Institute 
(KDRI), the National Institutes of Health (NIH, USA), 
German Research Centre for Environment and Health 
(GSF) and Chinese National Human Genome Centre 
(CHGC) (6). Our consortium then held an international 
workshop called Human Full-Length cDNA Annotation 
Invitational (H-Invitational or H-Inv) to manually 
annotate the registered human full-length cDNA se- 
quences on our annotation system by expert scientists 
and annotators (7). To release the annotation results, the 
first H-lnvDB was constructed in 2004, and as of the third 
version in 2006, H-lnvDB was extended to include all pub- 
lished human cDNA in addition to H-Inv human 
full-length cDNA (8). 

At present, H-lnvDB has been developed as not only a 
human transcriptome database but also one of the largest 
integrative human omics databases available to human 
gene researchers in various biological fields. One of the 
features of H-lnvDB is that all published human cDNA 
sequences were annotated by a rigorous annotation 
pipeline confirmed at H-Invitational (7,9). For example, 
we examine sequence quality, sequence identity with the 
human reference genome sequence, sequence orientation 
(some cDNA sequences are registered in reverse direc- 
tion), chimeric or truncated cDNAs and possible contam- 
ination from other species. Thus, most artifacts were 
removed and misannotations were expected to be few. 
H-lnvDB also contains several specific H-Inv sub and sat- 
ellite databases based on the annotation of H-Inv human 
transcripts (Figure 1). Databases involving gene expres- 
sion (H-ANGEL) (10), molecular evolution (Evola) (11), 
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Figure 1. A schematic diagram of H-InvDB as a central hub for 
human omics study. Each content is described shortly in the Quick 
guide page (http://h-invitational.jp/hinv/ahg-db/tools.jsp). 



Number of gene 
clusters (HIX) 



Number of 
transcripts (HIT) 



Number of 
proteins (HIP) 



43 829 


244 709 


147 684 


Table 2. Statistics of representative HIPs 


Category Definition 


Number of 






representative HITs 


1 


Identical to known human protein 


16128 




(>98% identity and 100% coverage) 




11 


Similar to known protein 


5872 




(>50% identity and >50% coverage) 




111 


InterPro domain containing protein 


898 


IV 


Conserved hypothetical protein 


1705 


V 


Hypothetical protein 


5268 


VI 


Hypothetical short protein 


5068 




(20-79 amino acids) 




VII 


Pseudogene candidates 


692 


Total 




35631 



genetic polymorphism (VarySysDB) (12) and alternative 
splicing (H-DBAS) (13) have been developed. Thus, users 
can find objective human annotation information in diverse 
combinations by using the search system of H-InvDB. In 
addition to these databases, H-InvDB is also connected 
with external databases by the web service application 
program interfaces (APIs) and Hyperlink Management 
System (HMS) (14). On these accounts, H-InvDB is a 
reliable and useful database for omics studies. 



CHARACTERISTICS OF H-InvDB RELEASE 8.0 

Update information 

In the latest version of H-InvDB 8.0, 244 709 human tran- 
script sequences extracted from DDBJ (15) were freshly 
mapped on the assembled reference genome UCSC hgl9 

(16) . Clustering the transcripts revealed 43 829 gene loci 
called H-Inv clusters (HIXs) (Table 1). Among these 
43 829 genes, 35 631 were predicted as potential protein- 
coding genes. This number is much larger than the number 
of nonredundant protein entries in UniProtKB/SwissProt 

(17) , which is a literature-based, human curated database 
of known proteins, because H-InvDB contains both 
known and predicted proteins from human transcripts. 
We classified them into seven protein categories according 
to the strength of protein evidence (7) and found that 
22 898 genes were predicted to have at least one protein 
functional motif (Categories I III) (Table 2). 

Including all these protein categories, all H-Inv 
transcripts (HITs) were annotated with various sequence 
features, such as gene structures, alternative splicing 
variants, noncoding functional RNA, protein functions, 
functional domains, subcellular localizations, metabolic 
pathways, protein 3D structure, genetic polymorphisms 
(single-nucleotide polymorphism, indels and microsatellite 
repeats), association with diseases, gene expression 
profiling, molecular evolutionary features, protein- 
protein interactions (PPIs) and gene families/groups. 



These annotations were assigned to not only H-InvDB 
but also the corresponding specific H-Inv sub and satellite 
databases in detail. These annotations are also used as 
search items in the H-InvDB Navi system (8) for 
compound retrieval. Among the H-Inv satellite databases, 
H-InvDB Enrichment Analysis Tool (HEAT) (8) was con- 
siderably upgraded. HEAT is a tool for gene-set enrich- 
ment analysis based on various annotation in H-InvDB, 
such as InterPro (18), GO (19), KEGG pathway (20), 
SCOP (21), subcellular localization, chromosomal band, 
gene family and tissue specific expression in H-ANGEL 
(10). It searches for H-InvDB annotations that are signifi- 
cantly enriched in a user-defined gene sets as compared 
with the entire H-InvDB representative protein-coding 
transcripts. We newly added promoter motifs of all 
human genes based on JASPAR (22) and PPIs in the 
HEAT system. This enabled us to conduct extensive 
data mining with the HEAT system. 

Advantages of H-InvDB 

We compared 43 829 H-Inv genes with RefSeq (23) and 
Ensembl (24) genes to enumerate the numbers of unique 
and overlapping entries. Although the numbers of H-Inv 
unique genes were similar to those of the Ensembl unique 
ones (19 309 and 19 063, respectively) (Figure 2A), 
H-InvDB uses only rigorously annotated human cDNA 
sequences, including those of experimentally validated 
full-length cDNA (7). These characteristics suggest that 
our uniquely annotated genes were likely to be biologically 
functional. To investigate the evidence for protein coding 
of H-Inv unique genes, we also compared frequencies of 
the genes in protein categories between H-InvDB and the 
consensus coding sequence (CCDS) (25) (Figure 2B). The 
result indicated that unknown functional proteins 
(Categories V and VI) and nonprotein-coding sequences 
were frequent in H-Inv unique genes. As described earlier, 
these H-Inv unique unknown proteins were completely 
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Figure 2. Comparison of gene numbers between H-InvDB and other databases. (A) The Venn diagram represents the numbers of unique and 
overlapping genes among H-InvDB, RefSeq and Ensembl. (B) The bar graph represents the numbers of H-Inv unique genes when compared with 
CCDS genes. The roman numerals indicate protein categories shown in Table 2. 



transcribed as they can indeed have some functions. In 
fact, 233 genes, which have been classified as hypothetical 
proteins (Categories V and VI) in the previous version of 
H-InvDB 6.2, turned out to be functional proteins 
(Categories I III) in the latest version of H-InvDB 8.0, 
because they were found in Online Mendelian 
Inheritance in Man (OMIM) (26) (Table 3). Among 
them, 11 Category I genes were suggested to associate 
with Mendelian disorders based on OMIM 
(Supplementary Table SI). Two of the 11 genes were 
annotated as Waldenstrom's macroglobulinemia suscepti- 
bility and other two were annotated as psoriasis suscepti- 
bility. In addition, 11 genes that have been classified as 
hypothetical proteins (Categories V and VI) in H-InvDB 
6.2 turned out to be noncoding RNA candidates 
(Supplementary Table S2). Four of these genes were 
annotated as similar to functional noncoding RNAs. 

New features 

We had originally developed annotation viewers for tran- 
scriptomes and genomes, called 'Transcript view' and 
'Locus view', respectively. In addition to these viewers, 



Table 3. Protein category-upgraded genes relating with Mendelian 
disorders in only H-InvDB 8.0 



Category" 




Number of category-upgraded genes 


Upgrade from V or 


VI to I 


1 1 


Upgrade from V or 


VI to II 


209 


Upgrade from V or 


VI to III 


13 



"Definition of category is shown in Table 2. 



we provided a new viewer named 'Protein view' for the 
annotations of the human proteome (Figure 3A). In 
Protein view, annotation information of H-Inv proteins 
(HIPs) is provided. Furthermore, through the web 
service APIs, a link to Glycoprotein DataBase (27) is 
added and the glycosylation site is illustrated in the 
figure of Protein view. As human cDNA clones are neces- 
sary for protein expression experiments, we added links to 
the human cDNA clone databases such as Biological 
Resource Center (NBRC) and Human Gene and Protein 
Database (HGPD) (28), which are connected by HMS. 
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Summary j Full report j 

| Protein info | Member | Motif J Function J PTM J Subcellular loc. J Protein structure | Evolution J Polymorphlsj 

Protein information 

HIP ID 



Length 




103 


Codon Adaptation Index (CAI). 


0.774 




RefSeq 


HM 003545: 


Database links 


UniProt 


A2CI30 ; 




CCDS 


^CI30: 



Original trans 


ript information 


Representative H-lnv transcript ID 
H-inv cluster ID 
Predicted CDS 

Chromosome 
Location 

Genomic location 

CDS position 
| Strand 


||HIT000035384 


HHIX000E643 


40.351 ; 103[aa];Orientation:+1 ; 
6 


6p22.2 


26204838-26206264 




Accession number 


BCO 109261 


CAGE tag ID 


NA 


EST ID 


NA 


Clone Number 

Experimental resources C 




RrNBRC HHGPyElAntibodv(HIST1H4El: fflcataloa(HIST1H4E): 


Length of cDNA 


1^fr1[bp)lMa. BftfSnTl )[A:377 T:423 G:331 C:330] ; 


Database links 


RefSeq 


NM 003545; 


Ensembl 
EntreiGene 
KEGG GENES 
GeneCard ^ 


ENST00000360441 : 


Entrez Gene ID:8367 


KEGG GENES(8367t ' 


HIST1H4E : 


etc 


EQ Human-Gene diversity Of Life-srvle related Diseases : 



VVPCDq 



Protein-Protein, Protein-Complex and Complex-Complex 
Interaction Viewer with Integrative Annotation [ w\ v on pcDq i 



PCDq is a human protein complex database with quality check index (QCI), which cells us the evidence level as 

members of the protein complex. We predicted 1,319 human protein complexes from 32,198 PPls comprised of 9,268 proteins by finding densely 
connected regions with their cluster properties in the PPI network. We annotated the predicted complexes with our defined procedures by human 
curators that confirm the existence of the complex actually available in references in advance and then integrate data entities such as protein function, 
localization, structure, expression profile, gene locus, and binary interactions among complex member proteins and complex outside adjacent proteins. 



Protein Complex Annotation at the Interactome Level 
Hshow annotated complex list in the human interactome. 




Figure 3. Screenshot of a part of protein view and the top page of PCDq. (A) Hyperlinks to NBRC and HGPD are shown in a red circle. 
Entrance to PCDq is http://www.h-invitational.jp/hinv/pcdq/. 



(B) 



Using these links, users can access the databases and 
obtain actual human cDNA clones for various experi- 
ments. A new subdatabase was also constructed and con- 
nected with H-InvDB. This new subdatabase called 
Protein Complex Database with quality index (PCDq) 
(29) is a human protein complex database with complex 
quality index, which describes evidence levels as subunits 
(protein members) of the protein complex. From the 
human PPI network dataset integrated from the six PPI 
data, human protein complexes were predicted and 
curated with the literature. Thus, PCDq consists of both 
known and predicted complexes/subunits (Figure 3B). 
PCDq is expected to enable users to investigate protein 
interactions in more detail by protein subunit rather 
than whole protein. 



FUTURE PERSPECTIVES 

At present, the identification of all human proteins is pro- 
ceeding worldwide. H-InvDB will continue to offer tools 
for proteome studies. For example, we are now collecting 
information on posttranslational modification. Using 
feedback from various experimental results at the 
protein level, we intend to develop H-InvDB as the best 
central hub for human omics study. In addition, personal 
genome annotation such as the prediction of disease sus- 
ceptibility using individual gene mutations will be much 
required. Therefore, we intend to expand the field of 
personal genomics in future. In addition to the web 
service APIs of the present H-InvDB, we will provide an- 
notation data in the Resource Description Framework 
(RDF) (http://www.w3.org/RDF/). We aim to improve 
the efficiency of accessing molecular biological data by 
integrating international databases in a more sophisticated 
manner using this semantic web technology. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Tables 1 and 2. 
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